Social media has been one of the main information consumption sources for the public, allowing people to seek and spread information more quickly and easily. However, the rise of various social media platforms also enables the proliferation of online misinformation. In particular, misinformation in the health domain has significant impacts on our society such as the COVID-19 infodemic. Therefore, health misinformation in social media has become an emerging research direction that attracts increasing attention from researchers of different disciplines. Compared to misinformation in other domains, the key differences of health misinformation include the potential of causing actual harm to humans' bodies and even lives, the hardness to identify for normal people, and the deep connection with medical science. In addition, health misinformation on social media has distinct characteristics from conventional channels such as television on multiple dimensions including the generation, dissemination, and consumption paradigms. Because of the uniqueness and importance of combating health misinformation in social media, we conduct this survey to further facilitate interdisciplinary research on this problem. In this survey, we present a comprehensive review of existing research about online health misinformation in different disciplines. Furthermore, we also systematically organize the related literature from three perspectives: characterization, detection, and intervention. Lastly, we conduct a deep discussion on the pressing open issues of combating health misinformation in social media and provide future directions for multidisciplinary researchers.
translated by 谷歌翻译
当今,分会一代成为在线视频的实用技术。本章断点使用户能够快速找到所需的零件并获得总结注释。但是,没有公共方法和数据集用于此任务。为了促进该方向的研究,我们介绍了一个名为Chapter-gen的新数据集,该数据集由大约10K用户生成的视频和带注释的章节信息组成。我们的数据收集过程是快速,可扩展的,不需要任何其他手动注释。在此数据集之外,我们设计了一个有效的基线,专门针对视频章节生成任务。捕获视频的两个方面,包括视觉动态和叙述文本。它分别将本地和全球视频功能分别用于本地化和标题生成。为了有效地解析长时间的视频,Skip滑动窗口机构旨在定位潜在的章节。并且开发了交叉注意的多模式融合模块,以汇总标题生成的本地功能。我们的实验表明,所提出的框架比现有方法取得了优越的结果,这表明即使在微调后也无法直接传输类似任务的方法设计。代码和数据集可在https://github.com/czt117/mvcg上找到。
translated by 谷歌翻译
机器学习模型在许多领域都表现出了有希望的表现。但是,担心他们可能会偏向特定的群体,阻碍了他们在高级申请中的采用。因此,必须确保机器学习模型中的公平性。以前的大多数努力都需要访问敏感属性以减轻偏见。尽管如此,由于人们对隐私和法律依从性的认识日益增加,获得具有敏感属性的大规模数据通常是不可行的。因此,一个重要的研究问题是如何在隐私下做出公平的预测?在本文中,我们研究了半私人环境中公平分类的新问题,其中大多数敏感属性都是私有的,只有少量的干净敏感属性可用。为此,我们提出了一个新颖的框架Fairsp,可以首先学会通过利用有限的清洁敏感属性来纠正隐私保证下的嘈杂敏感属性。然后,它以对抗性方式共同建模校正和清洁数据以进行歧义和预测。理论分析表明,当大多数敏感属性都是私有的时,提出的模型可以确保公平。现实世界数据集的实验结果证明了所提出的模型在隐私下做出公平预测并保持高精度的有效性。
translated by 谷歌翻译
图形离群值检测是一项具有许多应用程序的新兴但至关重要的机器学习任务。尽管近年来算法扩散,但缺乏标准和统一的绩效评估设置限制了它们在现实世界应用中的进步和使用。为了利用差距,我们(据我们所知)(据我们所知)第一个全面的无监督节点离群值检测基准为unod,并带有以下亮点:(1)评估骨架从经典矩阵分解到最新图形神经的骨架的14个方法网络; (2)在现实世界数据集上使用不同类型的注射异常值和自然异常值对方法性能进行基准测试; (3)通过在不同尺度的合成图上使用运行时和GPU存储器使用算法的效率和可扩展性。基于广泛的实验结果的分析,我们讨论了当前渠道方法的利弊,并指出了多个关键和有希望的未来研究方向。
translated by 谷歌翻译
现代机器学习(ML)模型越来越流行,并广泛用于决策系统。但是,研究表明,ML歧视和不公平性的关键问题阻碍了他们对高级应用程序的采用。对公平分类器的最新研究引起了人们的重大关注,以开发有效的算法以实现公平性和良好的分类性能。尽管这些公平感知到的机器学习模型取得了巨大的成功,但大多数现有模型都需要敏感属性来预处理数据,将模型学习正规化或后处理预测以具有公平的预测。但是,由于隐私,法律或法规限制,敏感属性通常是不完整甚至不可用的。尽管我们缺乏训练目标域中公平模型的敏感属性,但可能存在具有敏感属性的类似域。因此,重要的是从类似域中利用辅助信息,以帮助改善目标域中的公平分类。因此,在本文中,我们研究了探索域适应以进行公平分类的新问题。我们提出了一个新框架,可以同时估算目标域中的公平分类器时,可以同时估算敏感属性。现实世界数据集的广泛实验说明了提出的公平分类模型的有效性,即使目标域中没有敏感属性。
translated by 谷歌翻译
大型预训练的语言模型(PLM)的最新进展导致了自然语言理解(NLU)任务的令人印象深刻的增长,并具有特定于任务的微调。但是,直接调整PLM在很大程度上依赖大量的标记实例,这些实例通常很难获得。迅速对PLM的调整已被证明对各种少数次任务很有价值。现有的作品研究基于迅速的NLU任务的基于及时的调整,主要集中于用语言器来得出正确的标签单词或生成及时的模板,以从PLM中启发语义。此外,还对常规数据增强方法进行了验证,可用于少量射击任务。但是,目前几乎没有针对基于及时的调整范式设计的数据增强方法。因此,我们研究了迅速的少数射击学习者的新数据增强问题。由于标签语义对于迅速的调整至关重要,因此我们提出了一种新颖的标签引导数据增强方法促进DA,该方法利用了丰富的标签语义信息以进行数据增强。很少的文本分类任务的广泛实验结果表明,我们提出的框架通过有效利用标签语义和数据扩展来实现自然语言理解来实现卓越的性能。
translated by 谷歌翻译
在线和离线手写的中文文本识别(HTCR)已经研究了数十年。早期方法采用了基于过度裂段的策略,但遭受低速,准确性不足和角色分割注释的高成本。最近,基于连接主义者时间分类(CTC)和注意机制的无分割方法主导了HCTR的领域。但是,人们实际上是按字符读取文本的,尤其是对于中文等意识形态图。这就提出了一个问题:无细分策略真的是HCTR的最佳解决方案吗?为了探索此问题,我们提出了一种基于细分的新方法,用于识别使用简单但有效的完全卷积网络实现的手写中文文本。提出了一种新型的弱监督学习方法,以使网络仅使用笔录注释进行训练。因此,可以避免以前基于细分的方法所需的昂贵字符分割注释。由于缺乏完全卷积网络中的上下文建模,我们提出了一种上下文正则化方法,以在培训阶段将上下文信息集成到网络中,这可以进一步改善识别性能。在四个广泛使用的基准测试中进行的广泛实验,即Casia-HWDB,Casia-Olhwdb,ICDAR2013和Scut-HCCDOC,表明我们的方法在线和离线HCTR上都显着超过了现有方法,并且表现出比CTC/ CTC/ CTC/ CTC/ CTC/速度高得多的方法。基于注意力的方法。
translated by 谷歌翻译
Deep learning models can achieve high accuracy when trained on large amounts of labeled data. However, real-world scenarios often involve several challenges: Training data may become available in installments, may originate from multiple different domains, and may not contain labels for training. Certain settings, for instance medical applications, often involve further restrictions that prohibit retention of previously seen data due to privacy regulations. In this work, to address such challenges, we study unsupervised segmentation in continual learning scenarios that involve domain shift. To that end, we introduce GarDA (Generative Appearance Replay for continual Domain Adaptation), a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data. In contrast to single-step unsupervised domain adaptation (UDA), continual adaptation to a sequence of domains enables leveraging and consolidation of information from multiple domains. Unlike previous approaches in incremental UDA, our method does not require access to previously seen data, making it applicable in many practical scenarios. We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译